VividDream
Generating 3D Scene with Ambient Dynamics

Generate an explorable 3D scene with ambient dynamics from a single input image or text prompt.

RGB Depth

waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls
waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls waterfalls

Abstract

We introduce VividDream, a method for generating explorable 4D scenes with ambient dynamics from a single input image or text prompt. VividDream first expands an input image into a static 3D point cloud through iterative inpainting and geometry merging. An ensemble of animated videos is then generated using video diffusion models with quality refinement techniques and conditioned on renderings of the static 3D scene from the sampled camera trajectories. We then optimize a canonical 4D scene representation using an animated video ensemble, with per-video motion embeddings and visibility masks to mitigate inconsistencies. The resulting 4D scene enables free-view exploration of a 3D scene with plausible ambient scene dynamics. Experiments demonstrate that VividDream can provide human viewers with compelling 4D experiences generated based on diverse real images and text prompts.

Method Overview



Our method takes either a single image or a text prompt as input. For an image input, we generate a caption using BLIP-2 and estimate its depth using DepthAnything. For a text prompt input, we generate the corresponding image using Stable-Diffusion. The image, text prompt, and estimated depth form the initialization for the subsequent stages.

In Stage 1, we expand the initial 3D point cloud through an iterative process of novel view inpainting and point cloud merging using aligned depth estimates.

Stage 2 focuses on generating the scene motions by rendering K static view-extrapolation videos covering the entire 3D scene. Each video is first animated using Time-Reversal with static renderings as conditions. To improve the video quality, we refine the animated videos using Stable Video Diffusion. However, this refinement may cause the camera motion to deviate from the desired trajectory. We mitigate this issue by applying a smooth transition on the last few frames using FILM to match the conditioning end view.

Finally, in Stage 3, we train the 4D scene model, 4DGS, using the animated videos from Stage 2. To handle appearance and motion inconsistencies among the multi-view videos, we apply visibility masks with soft blending weights and introduce a per-video motion embedding inspired by NeRF-W. The resulting 4D scene model enables consistent motion and immersive 4D scene exploration.


Scene Motion Generation with Controlled Camera Trajectory

Static-point-cloud rendering video with specified trajectory

Stable Video Diffusion

Naively applying SVD with only the start view results in uncontrollable camera poses.


Comparisons

Baseline

Ours full

The baseline method first generates a single SVD video and then perform video 3D reconstruction for training a 4D scene model, leading to many unseen areas and limited free-view synthesis.

Acknowledgements

We extend our gratitude to the developers of Stable Video Diffusion, Stable Diffusion-XL, FILM, DepthAnything, and BLIP-2 for providing their models and codes. We also thank the authors of Time-Reversal for their detailed documentation, which was invaluable for our re-implementation.